Notes

concept categorization (data, how to define on topic models?)

Data

  • Latest Wikipedia corpus
  • Extracted plain text
  • Only used first 1000 words per document
  • 87,380,300 sentences
  • 1,813,672,600 words, 9,996,018 unique words

topic model training

  • Using mallet
  • 256 topics, 400 iterations, 13 hours

word2vec training

  • skip-gram model in gensim
  • 3.5 hours
  • remove words occurring less than 50 times --> 386,046 words unique words (98 % of original corpus)

Setup


In [1]:
%matplotlib notebook

import itertools
import logging
from functools import partial

import gensim
import matplotlib.pyplot as plt
import numpy as np
import pandas as pnd
from sklearn.cluster import *
from sklearn.decomposition import PCA, RandomizedPCA
from sklearn.manifold import TSNE

from knub.thesis.util import *

In [8]:
d = np.array([
    [1.0, 2.0, 3.1],
    [0.5, 1.2, 4.0],
    [-1.0, 2.1, 1.0]
])
pca(d, 2)


Out[8]:
array([[-0.71999136,  0.62490106],
       [-1.36532971, -0.50803249],
       [ 2.08532107, -0.11686857]])

In [2]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [4]:
from IPython.core.display import HTML
HTML("""
<style>
div.text_cell_render p, div.text_cell_render ul, table.dataframe {
font-size:1.3em;
line-height:1.1em;
}
</style>
""")


Out[4]:

Preprocessing

Topic model


In [13]:
# Prepare data in long form

df_topics = pnd.read_csv("../models/topic-models/topic.full.fixed-vocabulary.alpha-1-100.256-400.model.ssv",
                         sep=" ")
df_topics = df_topics.ix[:,-10:]
df_topics.columns = list(range(10))
df_topics["topic"] =  df_topics.index
df_topics["topic_name"] = df_topics[0]

df = pnd.melt(df_topics, id_vars=["topic", "topic_name"], var_name="position", value_name="word")
df = df[["word", "topic", "topic_name", "position"]]
df = df.sort_values(by=["topic", "position"]).reset_index(drop=True)
df[df.topic == 0]


Out[13]:
word topic topic_name position
0 would 0 would 0
1 time 0 would 1
2 new 0 would 2
3 first 0 would 3
4 however 0 would 4
5 years 0 would 5
6 could 0 would 6
7 later 0 would 7
8 one 0 would 8
9 made 0 would 9

Word embeddings


In [40]:
WORD2VEC_VECTOR_FILE = "/home/knub/Repositories/master-thesis/models/word-embeddings/GoogleNews-vectors-negative300.bin"
GLOVE_VECTOR_FILE = "/home/knub/Repositories/master-thesis/models/word-embeddings/glove.6B.50d.txt"
CBOW_VECTOR_FILE = "/home/knub/Repositories/master-thesis/models/word-embeddings/embedding.model.cbow"
SKIP_GRAM_VECTOR_FILE = "/home/knub/Repositories/master-thesis/models/word-embeddings/embedding.model.skip-gram"

#vectors_glove = gensim.models.Word2Vec.load_word2vec_format(GLOVE_VECTOR_FILE, binary=False)
#vectors_skip = gensim.models.Word2Vec.load_word2vec_format(SKIP_GRAM_VECTOR_FILE, binary=True)
#vectors_cbow = gensim.models.Word2Vec.load_word2vec_format(CBOW_VECTOR_FILE, binary=True)
vectors_word2vec = gensim.models.Word2Vec.load_word2vec_format(WORD2VEC_VECTOR_FILE, binary=True)
vectors_default = vectors_word2vec

In [42]:
def get_data_frame_from_word_vectors(df_param, vectors):
    df_param = df_param[df_param["word"].apply(lambda word: word in vectors)]    
    df_param["embeddings"] = df_param["word"].apply(lambda word: vectors[word])
    return df_param

df = get_data_frame_from_word_vectors(df.copy(), vectors_default)
df[df.topic == 0]


/opt/anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
Out[42]:
word topic topic_name position embeddings
0 would 0 would 0 [0.0893555, 0.129883, 0.212891, 0.177734, -0.1...
1 time 0 would 1 [-0.0473633, 0.1875, 0.0022583, 0.173828, -0.0...
2 new 0 would 2 [0.0112915, 0.0289307, 0.0834961, -0.0498047, ...
3 first 0 would 3 [0.122559, -0.0893555, 0.0269775, 0.0737305, 0...
4 however 0 would 4 [0.150391, 0.0412598, -0.0654297, 0.102051, -0...
5 years 0 would 5 [-0.126953, 0.208984, -0.106445, 0.0471191, -0...
6 could 0 would 6 [0.123535, 0.0319824, 0.150391, 0.152344, -0.0...
7 later 0 would 7 [0.188477, -0.173828, 0.15332, 0.0556641, 0.23...
8 one 0 would 8 [0.0456543, -0.145508, 0.15625, 0.166016, 0.10...
9 made 0 would 9 [-0.0559082, 0.117676, 0.210938, 0.00836182, 0...

In [43]:
# financial, muslim, teams in sport, atom physics, math
nice_topics = [5, 117, 158, 164, 171]
nice_topics = [0, 7, 236]

df_part = df[df.topic.apply(lambda topic: topic in nice_topics)].copy()
# Show topics of interest
df_tmp = pnd.DataFrame(df_part.groupby("topic")["word"].apply(lambda l: l.tolist()).tolist())
df_tmp.index = nice_topics
df_tmp


Out[43]:
0 1 2 3 4 5 6 7 8 9
0 would time new first however years could later one made
7 village town area population district located local parish road church
236 san california santa francisco diego mexico valley county state city

Topic model in word embedding space

Plot preparation


In [45]:
def plot_topics_in_embedding_space(reduction_method, df_param):
    embeddings = np.array(df_param["embeddings"].tolist())
    X = reduction_method(embeddings)
    df_tmp = df_param.copy()
    df_tmp["x"] = X[:,0]
    df_tmp["y"] = X[:,1]
    df_tmp = df_tmp[df_tmp.topic.apply(lambda topic: topic in nice_topics)]
    colors = {0: "red", 7: "blue", 236: "green", 164: "yellow", 171: "black"}
    plt.figure(figsize=(12, 8))
    plt.scatter(df_tmp.x, df_tmp.y, c=df_tmp.topic.apply(lambda topic: colors[topic]), s=80)
    
    ylim = plt.gca().get_ylim()
    step = (ylim[1] - ylim[0]) / 100
    
    for index, row in df_tmp.iterrows():
        plt.text(row.x, row.y - step, row.word, horizontalalignment='center', verticalalignment='top')

PCA


In [46]:
#plot_topics_in_embedding_space(pca, df)

In [47]:
plot_topics_in_embedding_space(pca, df_part) # third dimensions


t-SNE


In [ ]:
#plot_topics_in_embedding_space(tsne, df)

t-SNE with PCA initialization


In [22]:
plot_topics_in_embedding_space(tsne_with_init_pca, df)


Findings

Topics from the topic model do not seem to be in similar positions in the vector space, in general.

  • Specifity: The more specific a word is, the closer it is to similar words in the word embedding space. See the "muslim", "islam", "mohammad" cluster.
  • Ambiguity: Ambiguous words are a special problem, for example "current" is far away from the other physic terms because it has too many meanings. In fact, it is very close to the word "new".
  • Context-based similarity: Topic models can assign different similarities between words based on the context. They are good at finding similar words in a context, which might not be immediately obvious. Example: "distribution" is not very similar to "function", however in the company of "mean", "probability", "data", "random" it is. See also "Exploring the Space of Topic Coherence Measures" by Röder et al.

Word embedding similarity of topics

Avg. pairwise similarity


In [48]:
def average_pairwise_similarity(words, vectors):
    word_pairs = itertools.permutations(words, 2)
    similarities = [vectors.similarity(word1, word2) for word1, word2 in word_pairs if word1 < word2]
    return np.mean(similarities)

def average_top_similarity(words, vectors):
    word_pairs = itertools.permutations(words, 2)
    similarities = [(word1, vectors.similarity(word1, word2)) for word1, word2 in word_pairs]
    max_similarities = [max([s for w, s in l]) for _, l in itertools.groupby(similarities, lambda s: s[0])]
    return np.mean(max_similarities)

In [49]:
topic_lengths = list(range(2, 11))
def calculate_similarities_for_topic(df_topic, sim_function, vectors):
    words_in_topic = df_topic["word"].tolist()
    
    average_similarities = [sim_function(words_in_topic[:topic_length], vectors)
                            for topic_length in topic_lengths]
    
    return pnd.Series(average_similarities)

def calculate_similarity_matrix(sim_function, vectors):
    def partial_function(df_topic):
        return calculate_similarities_for_topic(df_topic, sim_function, vectors)

    df_similarities = df.groupby("topic").apply(partial_function)
    df_similarities.columns = ["%s-words" % i for i in topic_lengths]
    return df_similarities

In [50]:
df_similarities = calculate_similarity_matrix(average_pairwise_similarity, vectors_default)
df_similarities.mean()


Out[50]:
2-words     0.341394
3-words     0.324044
4-words     0.309233
5-words     0.292517
6-words     0.282898
7-words     0.271561
8-words     0.264227
9-words     0.257879
10-words    0.248975
dtype: float64

In [51]:
means = df_similarities.mean().tolist()
plt.figure(figsize=(12, 8))
plt.scatter(topic_lengths, means, s=80)
plt.title("Avg. word similarity (cosine similarity in WE space) of topics up to the nth word")
plt.xlim(0, 11)
plt.xticks(list(range(1, 12)))
#plt.ylim((0, 0.35))
plt.xlabel("topic length")
plt.ylabel("average similarity")


Out[51]:
<matplotlib.text.Text at 0x7fb6b4e86ad0>

Highest-similar topics

For comparison, here are a few standard similarities:

king-prince: {{vectors_default.similarity("king", "prince")}} king-queen: {{vectors_default.similarity("king", "queen")}} topic-topics: {{vectors_default.similarity("topic", "topics")}} buy-purchase: {{vectors_default.similarity("buy", "purchase")}}


In [52]:
def show_highest_similar_topics(topic_length, nr_topics=3):
    column = "%s-words" % topic_length
    df_top = df_similarities.sort_values(by=column, ascending=False)[:nr_topics]
    return df_top.join(df_topics)[[column] + list(range(topic_length))]

In [53]:
show_highest_similar_topics(3)


Out[53]:
3-words 0 1 2
topic
44 0.733276 song songs album
198 0.665253 animals animal dog
249 0.657131 store stores mall

In [54]:
show_highest_similar_topics(6)


Out[54]:
6-words 0 1 2 3 4 5
topic
180 0.567653 university college students campus student school
231 0.562334 israel jewish israeli jerusalem palestinian jews
77 0.553442 music opera orchestra symphony composer piano

In [55]:
show_highest_similar_topics(10)


Out[55]:
10-words 0 1 2 3 4 5 6 7 8 9
topic
193 0.497805 greek serbian greece serbia croatian bulgarian croatia bosnia yugoslavia bulgaria
76 0.470207 church bishop catholic saint diocese roman cathedral pope archbishop priest
111 0.444010 texas chicago illinois michigan state city kansas minnesota iowa missouri

Findings

  • In general, similarity is not very high after the first few words, when comparing against usual similarities
  • Again, topics with highly specific words have highest WE similarity